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We show that a protein can be trained to recognise multiple 
conformations, analogous to an associative memory, and pro- 
vide capacity calculations based on energy fluctuations and 
information theory. Unlike the linear capacity of a Hopfield 
network, the number of conformations which can be remem- 
bered by a protein sequence depends on the size of the amino 
acid alphabet as In A, independent of protein length. This ad- 
mits the possibility of certain proteins, such as prions, evolv- 
ing to fold to independent stable conformations, as well as 
novel possibilities for protein and heteropolymer design. 
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It is widely thought to be a design feature of real pro- 
teins that their native, biologically active state is both a 
deep global energy minimum and has a funnel of low en- 
ergy configurations leading toward it H. The deep well 
ensures that a significant fraction of protein molecules 
occupy the native state at any given moment. The fun- 
nel guides the molecule to fold to its stable native con- 
formation in a time much less than that required for it 
to explore all configurations, thus avoiding the so-called 
Levinthal paradox. 

Inverse protein folding, or protein design, consists of 
designing a sequence of amino acids that stably and 
quickly folds to a desired target conformation. This pro- 
cess may be expressed in the context of the energy land- 
scape, to which each sequence corresponds. For each 
compact conformation r c , there are typically a myriad 
of sequences which fold to it ||]. The set of sequences 
which fold to T c corresponds to those energy landscapes 
whose global minima lie above the target. Most of these 
will possess nominally global (shallow) minima and fold 
in very long rather than biological time scales fl. Of 
those which are deep, and hence thermodynamically sta- 
ble, fewer yet will resemble broadly sloping funnels. It 
is this last group of energy landscapes, and hence se- 
quences, to which natural proteins are believed to cor- 
respond. Not surprisingly, we wish to select for similar 
features when engineering artificial proteins. 

In this sense, protein design corresponds to choosing 
from the spectrum of all possible sequences a sequence 
whose landscape possesses the attributes we desire. Be- 
cause the spectrum is finite, however, we are not free to 
insist on an arbitrary topography; some landscapes have 
wells too deep or too numerous to be practicable. 

In this Letter we investigate the fundamental limit 
on the introduction of deep (thermodynamically stable) 
minima into the protein energy landscape H. We es- 



timate the typical maximum depth of the ground state 
well in a sequence trained to fold to a unique conforma- 
tion. By analogy with the theory of associative neural 
networks (ANNs) Q], we show how protein design can 
be generalised to provide recognition of several confor- 
mations rather than a single target state. We find that 
the number of conformations that a protein can recall is 
limited and calculate its capacity. Remarkably, the ca- 
pacity depends not on protein length but on the number 
of amino acid species. 

The ability of a protein sequence to encode multiple 
conformations has immediate implications on our under- 
standing of prions and other multi-stable proteins. In his 
Nobel lecture 0], Prusiner concludes 'The discovery that 
proteins may have multiple biologically active conforma- 
tions may prove no less important than the implications 
of prions for diseases. How many different tertiary struc- 
tures can [a protein] adopt? This query not only ad- 
dresses the issue of the limits of prion diversity but also 
applies to proteins as they normally function within the 
cell. . . .' In addition to predicting multi-stable proteins, 
our results suggest that artificial heteropolymers may be 
engineered to fold to multiple targets as well. We discuss 
possibilities for implementing target control to this end. 

Our thermodynamic capacity result — that the ma- 
nipulation of the energy landscape by the introduction 
of deep minima is limited — can be generalised. We in- 
vestigate the kinetic capacity of a protein, i.e., the limit 
on the size of a folding funnel, in a separate Letter ||. 

Proteins as Associative Memories A lattice pro- 
tein consists of a sequence S of N amino acids, or 
monomers, each of which can take on one of A possi- 
ble species. We denote the species of the ith monomer 
of S by Si, and monomers i and j interact according to 
the N x N extended pair potential U, where Uij = Ug^. 
and U is the A x A pair potential. 

Protein conformations may be represented by the con- 
tact matrix C , where Cy = 1 if monomers i and j are 
nearest neighbours and otherwise. Contacts between 
monomers adjacent along the protein chain are preserved 
and cannot influence the folding dynamics, so we exclude 
these from the contact map. For compact conformations, 
each interior monomer is surrounded by its chain neigh- 
bours plus z' others, where z' (the effective coordination 
number) is two less than z (the lattice coordination num- 
ber) . Contact patterns are thought to be a unique repre- 
sentation of compact conformations and we approximate 
them as independent. 

Protein folding may be considered pattern recognition 
in as much as the protein rapidly organises itself into 
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the target pattern C upon entering the target basin of 
attraction (funnel). By analogy with pattern associa- 
tion, this idea may be generalised to the recognition of 
multiple patterns. This raises the question of how to 
train the sequence to recognise more than one confor- 
mation. For lattice models, Shakhnovich and co-workers 
Jt],D have explored the folding of sequences designed to 
minimise a conformation's absolute and relative energies. 
The essence of the training technique is to embed the 
protein into the target conformation and optimise sta- 
bility over sequence space; the resulting (near-optimal) 
sequence spontaneous folds to the target. The dilute rep- 
resentation of conformations by contact patterns suggests 
that we can superimpose p patterns without saturation 
|)|, providing us with a total pattern to which we train 
in the usual way. This is essentially equivalent to the 
method used to select bi-stable 36-mers in Jl(J . 

Energy Function The energy of a sequence in con- 
formation C may be conveniently expressed 
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For a sequence trained to have minimal energy in confor- 
mation F M , the energy appears as 
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where minimisation is over all U corresponding to valid 
sequences and U* minimises E^. The energy of a fixed 
sequence S v folded to its ground state conformation is 
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where minimisation is over all C corresponding to valid 
conformations and C* minimises E v . As is common us- 
age, we refer to the quantity E v for an untrained sequence 
as the copolymer energy E cp . 

Throughout this Letter, the energy of a sequence re- 
alised in a particular conformation is indicated by E, 
while the Hamiltonian with which a sequence is trained 
(generally the linear combination of the energies realised 
in a number of conformations) is denoted by H. 

Capacity from Energetics We consider the thermo- 
dynamic capacity of a protein, that is, the number of 
conformations p that we can train the sequence to make 
simultaneously thermodynamically stable. For a protein 
to fold to a single target conformation, it is necessary that 
the energy of the trained sequence realised in that confor- 
mation, B!f m , be below the minimum fluctuations of the 
energy elsewhere, thereby making the target minimum 
global. Since the trained sequence is not correlated with 
distant conformations, energy fluctuations away from the 



target structure are statistically equivalent to those of a 
random copolymer sequence. We therefore require that 
the trained energy be less than the minimum energy of a 
random sequence, that is, E™ ln < E™ n . Folding to a set 
of p conformations requires that the minimum energy of 
all of these lie below E™ in . 

We first estimate the typical minimum copolymer en- 
ergy E" p n . Recalling that each row (or column) of the 
contact map C has z' bonds, the quantity E cp from (|^) 
(before minimisation) is the sum of bonds. Since 
the extended pair potential U of the copolymer from (|^) 
is untrained, these contact energies are uncorrelated and 
may be considered random. Assuming a distribution of 
bonds with zero mean (as is the case of that found in 
]IT[) and standard deviation a, we find, in accordance 
with the central limit theorem, that E cp is distributed as 
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where a 2 p 
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of order -jp-c. The ground state energy -E™ 111 is the least 
of all possible samples of ([|), each of which corresponds 
to a unique conformation. Since the number of compact 
conformations of an A-mer grows as n , where n ~ 1.85 
on a cubic lattice |12] , the energy of the ground state is 
the minimum of K^^amples of f(E cp ). 

What is the minimum of M samples of a random vari- 
able X distributed according to a gaussian g{x)l For 
convenience we assume zero mean and standard devi- 
ation ax- The probability distribution of x being the 
minimum of M samples of X is given by 
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where G(x) = g(x')dx' is the usual cumulative dis- 
tribution. Maximising <? mm with respect to x yields the 
transcendental equation x min (l - G(x min )) = -cr 2 (M - 
l)(?(a; min ), where x min is the minimum of the M realisa- 
tions of X. For reasonably large M, G(x) is small and 



we estimate x 
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By way of (jq) , we can express the ground state energy 
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We now approximate the typical energy of a sequence 
optimally trained to a set of p target conformations and 
arranged in one of these configurations. The total con- 
tact map, to which we train by energy minimisation with 
respect to the sequence @, is defined as a linear super- 
position of the p corresponding contact maps, that is 
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The minimum Hamiltonian associated with the total 
contact map may then be written 
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where here {/* minimises -f/tot- It is simply the sum of 
the p individual conformational energies of the sequence 
implied by U* . We re-express the right side of (^) as the 
sum over i of the total energy associated with monomer i, 
-fftot; j each minimised with respect to the choice of amino 
acid at monomer i, 



rfmin 
H tot 



N 

^min^totj; 

i=l 



(10) 



iJtotj is obtained by summing over the connections to 
monomer i, 



protein corresponding to the sequence (either in vitro or 
via computer simulation), allowing it to fold and observ- 
ing the p most occupied, and consequently lowest, target 
conformations. 

The information retrieved by learning a single confor- 
mation may be determined as follows. Given k n pos- 
sible compact conformations, the information contained 
in one conformation is equivalent to the number of bits 
necessary to express a number between 1 and k n , viz., 
Iii2(k n ). Since the p target configurations are assumed 
to be independent, the total retrieved information scales 
linearly with p, that is, Ir = pN hi2 n. 

The information transmitted may be similarly deter- 
mined. Since the number of sequences grows as A N , the 
information associated with a sequence is ln.2(A N ), and 
the total transmitted information is It = N hi2 A. 

Information theory dictates that the information re- 
trieved must not be greater than the information trans- 
mitted, that is, 
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pNln 2 K < N\n 2 A. 
It readily follows that the bound on p is 
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Since C has z' bonds connecting to monomer i, each H io t i 
is the sum of ^ random interaction energies freely cho- 
sen from the pair potential As before, we approxi- 
mate the distribution of H ioti by its central limit theorem 
form; it is a gaussian with variance of ot . = ^-o 2 . This 

estimation is valid out to |-HtotJ of order ^-o. 

The Hamiltonian H to ti at each monomer is minimised 
with respect to the choice of amino acid by choosing the 
smallest of A samples from the distribution of i/tot* ~ 
again we wish to estimate the minimum of many samples 
of a gaussian. By way of (JsJ) jl4) , we find that 
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When the trained sequence is in one of the p target struc- 
tures, the average energy of the sequence is given by 
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Equation ( |l3| ) and results from simulation are plotted 
in Figure |l| for p = 1. Apart from a prefactor of 0.847, 
the predicted dependence of well depth on A is in good 
agreement with observation. Calculations for p > 1 are 
ongoing and will be presented elsewhere. 

Comparing the minimum copolymer energy (H) and the 
minimum energy of the trained sequence (13) yields 
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Capacity from Information Theory The thermo- 
dynamic capacity of a protein may also be derived via 
information theory. Consider the transmission of a mes- 
sage, which has been encoded as an N letter sequence. 
The message is decoded empirically by constructing the 



which is identical to the result ( |14j ) deduced from fluctu- 
ations in the energy landscape. 

Discussion of Capacity Our bound on capacity has 
been derived in two ways: by comparison of the trained 
and copolymer minimum energies, which depends on the 
method of training (in our case the superposition rule), 
and by an information theoretic argument, which does 
not. The equality of the two results suggests that our 
constant capacity result is not a shortcoming of the su- 
perposition rule. 

That our bound on memory is independent of chain 
length TV may seem surprising given that the capacity of 
a fully connected ANN grows linearly with the number 
of neurons n. The resolution is that, in both cases, the 
number of patterns which can be stored is of order the 
number of connections divided by the number of nodes. 
In the case of a protein the number of active connections 
(contacts) is restricted to of order N, whereas for an ANN 
all n 2 connections are allowed to contribute significantly. 
The divisor arises because the amount of information in 
a pattern is proportional to N and n, respectively. 

What happens to the protein energy landscape upon 
introducing further target conformations? Consider an 
energy landscape in which there lies a single well of max- 
imal depth. As a second (and, by assumption, indepen- 
dent) well is introduced, the depth of the first is reduced 
(Figure As p approaches p max , the typical well depth 
diminishes such that, at p = p max , the minima are indis- 
tinguishable from nearby fluctuations in the landscape. 

For a uniform composition (i.e., a homopolymer), zero 
conformations are encodable, as expected. Frequently 
studied binary models allow at most one configuration 
to be stored, while for a 20 amino acid set, p max — 4.67. 
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In all cases, as p approaches p max , the minima become 
increasingly nominal. It may be possible to find a binary 
(e.g., H-P) sequence with a global minimum above an 
arbitrary compact target. But there is typically of or- 
der one sequence per conformation, and the sequence is 
statistically unlikely to be stable. In this sense, binary 
models are not accurate representations of proteins. 

Application to Heteropolymer Design and Pri- 
ons Our results may be considered in the more gen- 
eral context of heteropolymer engineering and rational 
drug design. The ability to remember multiple confor- 
mations admits a potentially dramatic increase in the 
variety of heteropolymer function. We have provided ar- 
guments that training to superimposed contact maps pro- 
vides a viable method of designing multiply-conforming 
sequences. To what extent can we exercise control over 
their occupied conformations? 

Shakhnovich and co-workers ]Tcj] observed in simula- 
tion what they refer to as kinetic partitioning: some se- 
quences designed to be stable in two conformations ini- 
tially fold to one structure before later folding to the 
other. On time scales short by comparison, the distribu- 
tion over conformations occurs according to kinetic acces- 
sibility rather than conformational stability. We are in- 
vestigating the extent to which temperature can be used 
to effect a change of the dominant occupied conformation 
before the onset of equilibrium. 

A naturally occurring and much studied heteropoly- 
mer thought to possess multiple stable conformations is 
prion protein |p.6| . Prions are infectious, transmissible 
pathogens composed exclusively of the modified protein 
PrP H. The chemical (primary) structure of PrP Sc 
is identical to the normal prion protein PrP , but its 
conformation (tertiary structure) is significantly differ- 
ent. Prion diseases, such as BSE, CJD and scrapie of 
sheep, are believed to result from the conformational con- 
version of PrP c to PrP Sc and the resulting accumulation 
of the abnormal protein g . 

Our calculations support the view that prion disease 
is caused by misfolding to a second stable conformation. 
Far from being confined to particular or correlated struc- 
tures, the ability of a protein to take on multiple biolog- 
ically active conformations is ubiquitous. In addition to 
pathological proteins such as prions, we conjecture the 
existence of proteins which fold to multiple biologically 
useful conformations. Definitive observations to this end 
would have significant implications on our understanding 
of protein function. 
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FIG. 1. (Negative) square of protein stability — — as a 
function of number of amino acid species A (log- linear). Pro- 
teins were trained to fold to a single 6x6x6 conformation 
with periodic boundary conditions by optimisation over se- 
quence space under constant composition. The dotted line 
was generated by (0)|p=i; introducing the prefactor 0.847 
gives the solid line. Data are shown for A = 9, 18, 36 and 72 
species, for each of which the mean and standard deviation 
were calculated from 12 runs with independent random pair 
potentials. 
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Conformations 



FIG. 2. Energy landscapes of sequences trained to be ther- 
modynamically stable in a one, two and p max — 1 target con- 
formations. As the number of targets increases, the depth to 
which the target wells can be trained diminishes. At 
the wells are lost among nearby fluctuations. 
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